Goto

Collaborating Authors

 phoneme sequence



Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

Tu, Huu Tuong, Khanh, Ha Viet, Dat, Tran Tien, Huan, Vu, Van Luong, Thien, Cuong, Nguyen Tien, Trang, Nguyen Thi Thu

arXiv.org Artificial Intelligence

ABSTRACT Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pre-trained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training. Index T erms-- Mispronunciation detection and diagnosis, retrieval-based methods, training-free framework, automatic pronunciation assessment 1. INTRODUCTION Mispronunciation Detection and Diagnosis is a fundamental task in Computer-Assisted Pronunciation Training (CAPT).


Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication

Kim, Deok-Seon, Lee, Seo-Hyun, Yin, Kang, Lee, Seong-Whan

arXiv.org Artificial Intelligence

Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on decoding predefined words or sentences, achieving open-vocabulary neural communication comparable to natural human interaction requires decoding unconstrained speech. Additionally, effectively integrating diverse signals derived from speech is crucial for developing personalized and adaptive neural communication and rehabilitation solutions for patients. This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes by leveraging phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals. Furthermore, we examine the properties affecting phoneme decoding accuracy during sentence reconstruction and offer neurophysiological insights to further enhance EEG decoding for more effective neural communication solutions. Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences, highlighting a significant step toward developing open-vocabulary neural communication systems adapted to diverse patient needs and conditions. Additionally, this study provides meaningful insights into the development of communication and rehabilitation solutions utilizing EEG-based decoding technologies.



TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

Nguyen, Minh N. H., Tran, Anh Nguyen, Dinh, Dung Truong, Van Vo, Nam

arXiv.org Artificial Intelligence

Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios. The challenge is particularly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). The TSPC employs a phoneme-centric approach, built upon an extended Vietnamese phoneme set as an intermediate representation to facilitate mixed-lingual modeling. Experimental results demonstrate that TSPC consistently outperforms existing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 19.9% with reduced training resources. Furthermore, the phonetic-based two-stage architecture enables phoneme adaptation and language conversion to enhance ASR performance in complex CS Vietnamese-English ASR scenarios


Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Teng, Matthew Kit Khinn, Zhang, Haibo, Saitoh, Takeshi

arXiv.org Artificial Intelligence

Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging due to the absence of auditory cues and the visual ambiguity of phonemes that exhibit similar visemes-distinct sounds that appear identical in lip motions. Existing methods often aim to predict words or characters directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity and require large amounts of pre-training data. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by an LLM model for word reconstruction to address these challenges. Stage 1 consists of V-ASR, which outputs the predicted phonemes, thereby reducing training complexity. Meanwhile, the facial landmark features address speaker-specific facial characteristics. Stage 2 comprises an encoder-decoder LLM model, NLLB, that reconstructs the output phonemes back to words. Besides using a large visual dataset for deep learning fine-tuning, our PV-ASR method demonstrates superior performance by achieving 17.4% WER on the LRS2 and 21.0% WER on the LRS3 dataset.


LLM-based phoneme-to-grapheme for phoneme-based speech recognition

Ma, Te, Bi, Min, Yusuyin, Saierdaer, Huang, Hao, Ou, Zhijian

arXiv.org Artificial Intelligence

In automatic speech recognition (ASR), phoneme-based multilingual pre-training and crosslingual fine-tuning is attractive for its high data efficiency and competitive results compared to subword-based models. However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). A challenge is that there seems to have information loss in cascading S2P and P2G. To address this challenge, we propose two training strategies: data augmentation with noisy phonemes (DANP), and randomized top-K marginalized (TKM) training and decoding. Our experimental results show that LLM-P2G outperforms WFST -based systems in crosslin-gual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.


PHISH in MESH: Korean Adversarial Phonetic Substitution and Phonetic-Semantic Feature Integration Defense

Kim, Byungjun, Kim, Minju, Park, Hyeonchu, Kim, Bugeun

arXiv.org Artificial Intelligence

As malicious users increasingly employ phonetic substitution to evade hate speech detection, researchers have investigated such strategies. However, two key challenges remain. First, existing studies have overlooked the Korean language, despite its vulnerability to phonetic perturbations due to its phonographic nature. Second, prior work has primarily focused on constructing datasets rather than developing architectural defenses. To address these challenges, we propose (1) PHonetic-Informed Substitution for Hangul (PHISH) that exploits the phonological characteristics of the Korean writing system, and (2) Mixed Encoding of Semantic-pHonetic features (MESH) that enhances the detector's robustness by incorporating phonetic information at the architectural level. Our experimental results demonstrate the effectiveness of our proposed methods on both perturbed and unperturbed datasets, suggesting that they not only improve detection performance but also reflect realistic adversarial behaviors employed by malicious users.


Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction

Lee, Jihwan, Feng, Tiantian, Kommineni, Aditya, Kadiri, Sudarsana Reddy, Narayanan, Shrikanth

arXiv.org Artificial Intelligence

Brain-computer interfaces (BCI) offer numerous human-centered application possibilities, particularly affecting people with neurological disorders. Text or speech decoding from brain activities is a relevant domain that could augment the quality of life for people with impaired speech perception. We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals by utilizing an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences. The proposed model architecture consists of three main parts: EEG module, speech module, and phoneme predictor. The EEG module learns to properly represent EEG signals into EEG embeddings. The speech module generates speech waveforms from the EEG embeddings. The phoneme predictor outputs the decoded phoneme sequences in text modality. Our proposed approach allows users to obtain decoded listened speech from EEG signals in both modalities (speech waveforms and textual phoneme sequences) simultaneously, eliminating the need for a concatenated sequential pipeline for each modality. The proposed approach also outperforms previous methods in both modalities. The source code and speech samples are publicly available.


SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

Lu, Haitian, Cheng, Gaofeng, Luo, Liuping, Zhang, Leying, Qian, Yanmin, Zhang, Pengyuan

arXiv.org Artificial Intelligence

Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.